Problem Statement Problem Statement¶

Exploring Property Assessment and Sales Data for Informed Decision-Making. In our quest for informed decision-making in real estate, we are presented with a comprehensive dataset encompassing various attributes related to property assessment and sales transactions. This dataset includes information such as the assessed value, sale amount, sales ratio, property type, and more, offering a rich source of insights into the real estate landscape.

In [1]:
# import all the necessary libraries
import numpy as np
import pandas as pd
 
# for visuals
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
 
plt.style.use('ggplot')
In [2]:
# import and read file ===> # pd.read_excel()  #pd.read_jason()  # pd.read_tsv()
df = pd.read_csv(r'C:\Users\akosu\Downloads\Real_Estate_Sales_2001-2020_GL.csv',encoding = 'ISO-8859-1')
df
C:\Users\akosu\AppData\Local\Temp\ipykernel_1820\209286398.py:2: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(r'C:\Users\akosu\Downloads\Real_Estate_Sales_2001-2020_GL.csv',encoding = 'ISO-8859-1')
Out[2]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
0 2020348 2020 09/13/2021 Ansonia 230 WAKELEE AVE 150500.0 325000.0 0.463000 Commercial NaN NaN NaN NaN NaN
1 20002 2020 10/02/2020 Ashford 390 TURNPIKE RD 253000.0 430000.0 0.588300 Residential Single Family NaN NaN NaN NaN
2 200212 2020 03/09/2021 Avon 5 CHESTNUT DRIVE 130400.0 179900.0 0.724800 Residential Condo NaN NaN NaN NaN
3 200243 2020 04/13/2021 Avon 111 NORTHINGTON DRIVE 619290.0 890000.0 0.695800 Residential Single Family NaN NaN NaN NaN
4 200377 2020 07/02/2021 Avon 70 FAR HILLS DRIVE 862330.0 1447500.0 0.595700 Residential Single Family NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
997208 190272 2019 06/24/2020 New London 4 BISHOP CT 60410.0 53100.0 1.137665 Single Family Single Family 14 - Foreclosure NaN NaN NaN
997209 190284 2019 11/27/2019 Waterbury 126 PERKINS AVE 68280.0 76000.0 0.898400 Single Family Single Family 25 - Other PRIVATE SALE NaN NaN
997210 190129 2019 04/27/2020 Windsor Locks 19 HATHAWAY ST 121450.0 210000.0 0.578300 Single Family Single Family NaN NaN NaN NaN
997211 190504 2019 06/03/2020 Middletown 8 BYSTREK DR 203360.0 280000.0 0.726300 Single Family Single Family NaN NaN NaN NaN
997212 190344 2019 12/20/2019 Milford 250 RESEARCH DR 4035970.0 7450000.0 0.541700 NaN NaN NaN NaN NaN NaN

997213 rows × 14 columns

Data Inspection¶

In [3]:
# .head() ===> to get the first 5 rows
df.head()
Out[3]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
0 2020348 2020 09/13/2021 Ansonia 230 WAKELEE AVE 150500.0 325000.0 0.4630 Commercial NaN NaN NaN NaN NaN
1 20002 2020 10/02/2020 Ashford 390 TURNPIKE RD 253000.0 430000.0 0.5883 Residential Single Family NaN NaN NaN NaN
2 200212 2020 03/09/2021 Avon 5 CHESTNUT DRIVE 130400.0 179900.0 0.7248 Residential Condo NaN NaN NaN NaN
3 200243 2020 04/13/2021 Avon 111 NORTHINGTON DRIVE 619290.0 890000.0 0.6958 Residential Single Family NaN NaN NaN NaN
4 200377 2020 07/02/2021 Avon 70 FAR HILLS DRIVE 862330.0 1447500.0 0.5957 Residential Single Family NaN NaN NaN NaN
In [4]:
# .tail() ===> to get the last 5 rows
df.tail()
Out[4]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
997208 190272 2019 06/24/2020 New London 4 BISHOP CT 60410.0 53100.0 1.137665 Single Family Single Family 14 - Foreclosure NaN NaN NaN
997209 190284 2019 11/27/2019 Waterbury 126 PERKINS AVE 68280.0 76000.0 0.898400 Single Family Single Family 25 - Other PRIVATE SALE NaN NaN
997210 190129 2019 04/27/2020 Windsor Locks 19 HATHAWAY ST 121450.0 210000.0 0.578300 Single Family Single Family NaN NaN NaN NaN
997211 190504 2019 06/03/2020 Middletown 8 BYSTREK DR 203360.0 280000.0 0.726300 Single Family Single Family NaN NaN NaN NaN
997212 190344 2019 12/20/2019 Milford 250 RESEARCH DR 4035970.0 7450000.0 0.541700 NaN NaN NaN NaN NaN NaN
In [56]:
# shape of the data
df.shape
Out[56]:
(997213, 14)
In [93]:
# check the info of the data  # .info()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997213 entries, 0 to 997212
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Serial Number     997213 non-null  int64  
 1   List Year         997213 non-null  int64  
 2   Date Recorded     997211 non-null  object 
 3   Town              997213 non-null  object 
 4   Address           997162 non-null  object 
 5   Assessed Value    997213 non-null  float64
 6   Sale Amount       997213 non-null  float64
 7   Sales Ratio       997213 non-null  float64
 8   Property Type     614767 non-null  object 
 9   Residential Type  608904 non-null  object 
 10  Non Use Code      289681 non-null  object 
 11  Assessor Remarks  149864 non-null  object 
 12  OPM remarks       9934 non-null    object 
 13  Location          197697 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 106.5+ MB
In [94]:
# .dtypes ===> for the data types of the columns
df.dtypes
Out[94]:
Serial Number         int64
List Year             int64
Date Recorded        object
Town                 object
Address              object
Assessed Value      float64
Sale Amount         float64
Sales Ratio         float64
Property Type        object
Residential Type     object
Non Use Code         object
Assessor Remarks     object
OPM remarks          object
Location             object
dtype: object
In [59]:
# .columns  # .to_list()
df.columns.to_list()
Out[59]:
['Serial Number',
 'List Year',
 'Date Recorded',
 'Town',
 'Address',
 'Assessed Value',
 'Sale Amount',
 'Sales Ratio',
 'Property Type',
 'Residential Type',
 'Non Use Code',
 'Assessor Remarks',
 'OPM remarks',
 'Location']
In [60]:
# check for mising values  # .isna()  # .isnull()  ===> both return boolean mask
df.isnull()
Out[60]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
0 False False False False False False False False False True True True True True
1 False False False False False False False False False False True True True True
2 False False False False False False False False False False True True True True
3 False False False False False False False False False False True True True True
4 False False False False False False False False False False True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
997208 False False False False False False False False False False False True True True
997209 False False False False False False False False False False False False True True
997210 False False False False False False False False False False True True True True
997211 False False False False False False False False False False True True True True
997212 False False False False False False False False True True True True True True

997213 rows × 14 columns

In [61]:
# .sum() with .isnull()  ===> gives the total of missing values in each column
df.isnull().sum()
Out[61]:
Serial Number            0
List Year                0
Date Recorded            2
Town                     0
Address                 51
Assessed Value           0
Sale Amount              0
Sales Ratio              0
Property Type       382446
Residential Type    388309
Non Use Code        707532
Assessor Remarks    847349
OPM remarks         987279
Location            799516
dtype: int64
In [11]:
# Create a heatmap to visualize the missing values
missing_values = df.isnull().sum()
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cmap='viridis', cbar=True, yticklabels=False)

# Display the missing values count on left side of the heatmap
for i, value in enumerate(missing_values):
    if value > 0:
        plt.text(i + 0.5, 0.5, str(value), color='red', fontweight='bold',
                 verticalalignment='center', horizontalalignment='center')

plt.title('Missing Values in DataFrame')
plt.show()

The information provided shows that serial number, List Year, Town, Assessed value, Sale Amount and Sales Ratio do not have any missing values, However the remaining data variables i.e.Date recorded, Address,Property Type, Residential Type, Non use code, Assessor Remarks, OPM remarks and Location all have missing values with OPM remarks recording the highest missing values of 987,279 and date recorded just 2 missing values.

In [62]:
# To check for negative values in 'Sale Amount'
negative_amount = df[df['Sale Amount'] < 0]
negative_amount
Out[62]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
In [63]:
# To check for negative values in 'Assessed Value'
negative_value = df[df['Assessed Value'] <0]
negative_value
Out[63]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
In [64]:
df['Property Type']
Out[64]:
0            Commercial
1           Residential
2           Residential
3           Residential
4           Residential
              ...      
997208    Single Family
997209    Single Family
997210    Single Family
997211    Single Family
997212              NaN
Name: Property Type, Length: 997213, dtype: object
In [65]:
# check the columns that are categorical  # .select_dtype
cat_cols = df.select_dtypes(include = ['category', 'object']).columns.to_list()
cat_cols
Out[65]:
['Date Recorded',
 'Town',
 'Address',
 'Property Type',
 'Residential Type',
 'Non Use Code',
 'Assessor Remarks',
 'OPM remarks',
 'Location']
In [111]:
# check the columns that are numerical  # .select_dtype
num_cols = df.select_dtypes(include = ['float64', 'int64']).columns.to_list()
num_cols
Out[111]:
['Serial Number', 'List Year', 'Assessed Value', 'Sale Amount', 'Sales Ratio']
In [67]:
# value counts for categorical columns
for column in cat_cols:
    print(df[column].value_counts())
Date Recorded
07/01/2005    877
08/01/2005    859
07/01/2004    840
06/30/2005    828
09/30/2005    781
             ... 
01/21/2006      1
07/07/2012      1
04/14/2012      1
07/28/2012      1
03/07/2020      1
Name: count, Length: 6387, dtype: int64
Town
Bridgeport       34201
Stamford         32529
Waterbury        28506
Norwalk          23960
New Haven        21346
                 ...  
Hartland           448
Scotland           430
Canaan             429
Union              261
***Unknown***        1
Name: count, Length: 170, dtype: int64
Address
MULTI ADDRESSES            620
8 SOUTH RD                 420
51 OLD SPRINGFIELD RD      172
312 N BISHOP AVE           114
RIGGS ST                   106
                          ... 
655 120 TALCOTTVILLE RD      1
414 SALMON BRK ST            1
21 BUTTERNUT KNL             1
59 BURNSIDE AVE UT 2 &       1
126 PERKINS AVE              1
Name: count, Length: 714371, dtype: int64
Property Type
Single Family     401612
Condo             105420
Residential        60728
Two Family         26408
Three Family       12586
Vacant Land         3163
Four Family         2150
Commercial          1981
Apartments           486
Industrial           228
Public Utility         5
Name: count, dtype: int64
Residential Type
Single Family    445016
Condo            117780
Two Family        29609
Three Family      14081
Four Family        2418
Name: count, dtype: int64
Non Use Code
25 - Other                 59916
14 - Foreclosure           52445
07 - Change in Property    34248
08 - Part Interest         15496
7.0                        14381
                           ...  
38.0                           1
48.0                           1
68.0                           1
33                             1
75                             1
Name: count, Length: 106, dtype: int64
Assessor Remarks
ESTATE SALE                                                               5044
BELOW MARKET                                                              2568
SHORT SALE                                                                2510
NEW CONSTRUCTION                                                          2048
FORECLOSURE                                                               1847
                                                                          ... 
LACK OF PARKING                                                              1
estate sale;per verification house substan worse than as of assmt date       1
SOLD W/ GARAGE UNIT G-1                                                      1
sold less than market value                                                  1
MOTIVATED SELLER -DIVORCE                                                    1
Name: count, Length: 66508, dtype: int64
OPM remarks
GOOD SALE PER MLS                                                   978
NO MLS                                                              500
NEW CONSTRUCTION                                                    345
SHORT SALE PER MLS                                                  326
TOTAL RENOVATION PER MLS                                            316
                                                                   ... 
REMODELED PER MLS - SEE PREVIOUS SALE #170076                         1
REO SALE - CASH ONLY SOLD AS IS                                       1
TWO SALES - ALSO SEE #1700209                                         1
NOT A VALID SALE PER TOWN SITE AND PER MLS SALE PRICE = $345,000      1
PER MLS CLOSING PRICE = $1,145,000                                    1
Name: count, Length: 4825, dtype: int64
Location
POINT (-72.36336 41.97461)    181
POINT (-73.41854 41.13449)    145
POINT (-73.45225 41.096)       87
POINT (-73.06359 41.52255)     77
POINT (-72.96095 41.54989)     73
                             ... 
POINT (-71.90591 41.34189)      1
POINT (-72.89441 41.8245)       1
POINT (-72.94279 41.69454)      1
POINT (-72.52547 41.26596)      1
POINT (-72.07006 41.53315)      1
Name: count, Length: 130529, dtype: int64
In [18]:
# value counts for numerical columns
for column in num_cols:
    print(df[column].value_counts())
Serial Number
10010      172
10018      172
10002      172
10003      171
10009      171
          ... 
124743       1
124571       1
125068       1
124724       1
1910419      1
Name: count, Length: 83537, dtype: int64
List Year
2004    84056
2020    66592
2003    64239
2005    61602
2002    59682
2001    59584
2019    58954
2018    50709
2016    49773
2014    49563
2006    48785
2015    46651
2017    45691
2009    42508
2013    39943
2012    35973
2007    35617
2010    33491
2008    32735
2011    31065
Name: count, dtype: int64
Assessed Value
0.0          6933
17500.0       603
105000.0      553
35000.0       536
70000.0       473
             ... 
1802750.0       1
1192500.0       1
20886.0         1
54041.0         1
4035970.0       1
Name: count, Length: 91378, dtype: int64
Sale Amount
150000.0     9401
200000.0     8999
250000.0     8471
175000.0     7547
225000.0     7421
             ... 
746167.0        1
338885.0        1
412776.0        1
1626501.0       1
1134708.0       1
Name: count, Length: 57372, dtype: int64
Sales Ratio
0.000000    8459
0.700000     964
1.000000     612
0.560000     422
0.630000     323
            ... 
6.210131       1
0.298041       1
0.393985       1
0.547837       1
1.137665       1
Name: count, Length: 547044, dtype: int64

Data Manipulation and Data Validation¶

Missing values for Property Type, Residential Type, Assessor Remarks, OPM remarks, Address, DateRecorded, Location and Non Use Code by using the mode method to replace the null values

In [68]:
# fill the missing with the mode
df['Address'].fillna(df['Address'].mode()[0], inplace = True)
df['Property Type'].fillna(df['Property Type'].mode()[0], inplace = True)
df['Residential Type'].fillna(df['Residential Type'].mode()[0], inplace = True)
df['Non Use Code'].fillna(df['Non Use Code'].mode()[0], inplace = True)
df['Assessor Remarks'].fillna(df['Assessor Remarks'].mode()[0], inplace = True)
df['Location'].fillna(df['Location'].mode()[0], inplace = True)
df['Date Recorded'].fillna(df['Date Recorded'].mode()[0], inplace = True)
df['OPM remarks'].fillna(df['OPM remarks'].mode()[0], inplace = True)
# check head of column
df.head()
Out[68]:
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type Non Use Code Assessor Remarks OPM remarks Location
0 2020348 2020 09/13/2021 Ansonia 230 WAKELEE AVE 150500.0 325000.0 0.4630 Commercial Single Family 25 - Other ESTATE SALE GOOD SALE PER MLS POINT (-72.36336 41.97461)
1 20002 2020 10/02/2020 Ashford 390 TURNPIKE RD 253000.0 430000.0 0.5883 Residential Single Family 25 - Other ESTATE SALE GOOD SALE PER MLS POINT (-72.36336 41.97461)
2 200212 2020 03/09/2021 Avon 5 CHESTNUT DRIVE 130400.0 179900.0 0.7248 Residential Condo 25 - Other ESTATE SALE GOOD SALE PER MLS POINT (-72.36336 41.97461)
3 200243 2020 04/13/2021 Avon 111 NORTHINGTON DRIVE 619290.0 890000.0 0.6958 Residential Single Family 25 - Other ESTATE SALE GOOD SALE PER MLS POINT (-72.36336 41.97461)
4 200377 2020 07/02/2021 Avon 70 FAR HILLS DRIVE 862330.0 1447500.0 0.5957 Residential Single Family 25 - Other ESTATE SALE GOOD SALE PER MLS POINT (-72.36336 41.97461)

Univariate Analysis¶

In [96]:
#statistical summary of the data - returns in float
df.describe()
Out[96]:
Serial Number List Year Assessed Value Sale Amount Sales Ratio
count 9.972130e+05 997213.000000 9.972130e+05 9.972130e+05 9.972130e+05
mean 4.311864e+05 2010.189829 2.791437e+05 3.911512e+05 1.044637e+01
std 6.549219e+06 6.237877 1.670610e+06 5.347270e+06 1.890192e+03
min 0.000000e+00 2001.000000 0.000000e+00 0.000000e+00 0.000000e+00
25% 3.044400e+04 2004.000000 8.760000e+04 1.400000e+05 4.867000e-01
50% 7.030300e+04 2010.000000 1.383900e+05 2.250000e+05 6.246000e-01
75% 1.518780e+05 2016.000000 2.255600e+05 3.650000e+05 7.852761e-01
max 2.000500e+09 2020.000000 8.815100e+08 5.000000e+09 1.226420e+06

The total number of records for each variable is 997,213. The mean Assessed value, Sales Amount and Sales Ratio is 279,143.7, 391,151.2 and 10.44 respectively. The maximum Assessed value and Sales Amount is 881,510,000 and 5,000,000,000 respectively. The earliest date of transaction is 2001 while the latest date is 2020

In [112]:
df.columns
Out[112]:
Index(['Serial Number', 'List Year', 'Date Recorded', 'Town', 'Address',
       'Assessed Value', 'Sale Amount', 'Sales Ratio', 'Property Type',
       'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks',
       'Location'],
      dtype='object')
In [7]:
# distribution of numerical column
columns_to_visualize = ['Assessed Value', 'Sale Amount', 'Sales Ratio']

# Sample Data for illustration purposes
data = {'Assessed Value': [50000, 75000, 100000, 125000, 150000],
        'Sale Amount': [80000, 120000, 95000, 140000, 160000],
        'Sales Ratio': [1.6, 1.6, 0.95, 1.12, 1.07]}


# Specify columns to visualize
columns_to_visualize = ['Assessed Value', 'Sale Amount', 'Sales Ratio']

# Set up the subplots
fig, axes = plt.subplots(nrows=1, ncols=len(columns_to_visualize), figsize=(15, 5))

# Customize the chart
fig.suptitle('Distribution of Numerical Columns', fontsize=16)

for i, col in enumerate(columns_to_visualize):
    axes[i].hist(df[col], bins=20, color='skyblue', edgecolor='black')
    axes[i].set_title(col)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].grid(axis='y', linestyle='--', alpha=0.7)

# Adjust layout to prevent overlapping
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

# Show the plot
plt.show()

Overall, the histograms provide insights into the distributions of the three numerical columns, helping to understand the typical values and variability within each variable.¶

Assessed Value:

The histogram for 'Assessed Value' shows that the majority of properties have assessed values between $50,000 and $100,000, with a peak around $75,000. There is a relatively small number of properties with higher assessed values. Sale Amount:

The 'Sale Amount' histogram indicates that most properties were sold at prices ranging from 80,000 to 160,000. The distribution appears slightly skewed to the right, with a peak around $120,000. There are fewer properties with sale amounts at the lower end of the scale. Sales Ratio:

The histogram for 'Sales Ratio' displays the distribution of the ratio between the sale amount and assessed value. A sales ratio of 1.0 indicates a sale at the assessed value. In this case, the distribution is centered around a sales ratio of 1.0, but there is some variability. It seems that a majority of properties have sales ratios close to 1.0, indicating sales close to their assessed values. There are a few properties with higher sales ratios, suggesting they were sold at a premium compared to their assessed values.

In [71]:
# the number of unique serial number in real estate
df['Serial Number'].nunique()
Out[71]:
83537

There are 83537 unique records in real estate

In [72]:
# top 5 highest sales amount
top5_customer = df['Sale Amount'].value_counts().head()
top5_customer
Out[72]:
Sale Amount
150000.0    9401
200000.0    8999
250000.0    8471
175000.0    7547
225000.0    7421
Name: count, dtype: int64
In [3]:
# Sample Data for illustration purposes
data = {'Serial Number': [1, 2, 3, 4, 5],
        'Sale Amount': [120000, 100000, 95000, 140000, 160000]}

# Group by 'Serial Number' and sum 'Sale Amount'
serial_number_sales = df.groupby('Serial Number')['Sale Amount'].sum()

# Select the top 5 serial numbers with the highest sales amount
top5_serial_numbers = serial_number_sales.nlargest(5)

# Create a customized column chart with data labels
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(top5_serial_numbers.index.astype(str), top5_serial_numbers.values, color='pink', edgecolor='black')

# Add data labels
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', fontsize=10)

# Customize the chart
ax.set_title('Top 5 Serial Numbers with Highest Sales Amount', fontsize=16)
ax.set_xlabel('Serial Number')
ax.set_ylabel('Sales Amount')
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()

The top 5 serial numbers 20200091, 160387, 160394, 160393 and 160389 have a total sales amount of 5,001,321,600; 408,974,700; 408,341,556; 408,058,727; and 406,441,967 respectively

In [73]:
# the number of unique Town
print(f'The number of unique Town is {df["Town"].nunique()} \n and they are as follows: \n{df["Town"].unique()}')
The number of unique Town is 170 
 and they are as follows: 
['Ansonia' 'Ashford' 'Avon' 'Berlin' 'Bethany' 'Bethel' 'Bethlehem'
 'Bloomfield' 'Branford' 'Bristol' 'Brookfield' 'Canaan' 'Canton'
 'Cheshire' 'Chester' 'Colchester' 'Columbia' 'Cornwall' 'Coventry'
 'Cromwell' 'Danbury' 'Derby' 'Eastford' 'East Haddam' 'East Haven'
 'Farmington' 'Chaplin' 'Clinton' 'East Lyme' 'Easton' 'Enfield' 'Essex'
 'Durham' 'Franklin' 'Glastonbury' 'Hamden' 'Granby' 'Greenwich'
 'Colebrook' 'East Windsor' 'Griswold' 'Bolton' 'Groton' 'Guilford'
 'Hartford' 'Harwinton' 'Milford' 'Killingly' 'Killingworth' 'Lebanon'
 'Lisbon' 'Litchfield' 'Lyme' 'Manchester' 'Putnam' 'Norwalk' 'Stafford'
 'Mansfield' 'Sherman' 'Meriden' 'Stratford' 'Roxbury' 'Oxford' 'Old Lyme'
 'Norwich' 'Sharon' 'Monroe' 'Tolland' 'Torrington' 'Naugatuck'
 'Ridgefield' 'New London' 'New Britain' 'Orange' 'New Canaan'
 'New Fairfield' 'New Hartford' 'New Haven' 'Somers' 'Newtown'
 'North Haven' 'West Haven' 'Morris' 'Thompson' 'Stonington' 'Stamford'
 'Newington' 'Plainfield' 'Vernon' 'Plainville' 'Watertown'
 'West Hartford' 'Plymouth' 'Portland' 'Redding' 'Warren' 'Rocky Hill'
 'Salem' 'Winchester' 'Shelton' 'Simsbury' 'Windsor' 'Southbury'
 'South Windsor' 'Brooklyn' 'Sterling' 'Ellington' 'Suffield' 'Thomaston'
 'East Hartford' 'Trumbull' 'Fairfield' 'Ledyard' 'Washington' 'Waterbury'
 'Hebron' 'Barkhamsted' 'Waterford' 'Westbrook' 'Wethersfield' 'Darien'
 'Willington' 'Wilton' 'Windsor Locks' 'Wolcott' 'Woodbury' 'Woodstock'
 'East Granby' 'Burlington' 'Bozrah' 'Goshen' 'Madison' 'Bridgeport'
 'Bridgewater' 'Kent' 'Beacon Falls' 'Andover' 'Hampton' 'Montville'
 'Prospect' 'Deep River' 'Southington' 'Norfolk' 'Westport' 'Windham'
 'Wallingford' 'Weston' 'Voluntown' 'Middletown' 'Middlefield'
 'Middlebury' 'Old Saybrook' 'North Canaan' 'Preston' 'Scotland' 'Sprague'
 'Pomfret' 'Seymour' 'Woodbridge' 'Union' 'Haddam' 'Canterbury'
 'Marlborough' 'New Milford' 'North Stonington' 'East Hampton' 'Hartland'
 'Salisbury' 'North Branford' '***Unknown***']
In [74]:
df['Town'].value_counts()
Out[74]:
Town
Bridgeport       34201
Stamford         32529
Waterbury        28506
Norwalk          23960
New Haven        21346
                 ...  
Hartland           448
Scotland           430
Canaan             429
Union              261
***Unknown***        1
Name: count, Length: 170, dtype: int64
In [5]:
import plotly.express as px
import pandas as pd

# Assuming df is your DataFrame with a 'Town' column
# If 'Town' column is not in df, replace it with your actual column name

# Create a DataFrame with the counts of each town
town_counts = df['Town'].value_counts().reset_index()
town_counts.columns = ['Town', 'Count']

# Create a horizontal bar chart
fig = px.bar(
    town_counts,
    y='Town',
    x='Count',
    orientation='h',
    title='Town Counts',
    labels={'Count': 'Town Count', 'Town': 'Town'},
    color='Count',  # You can add color for visual appeal
    color_continuous_scale='Viridis',
)

# Show the bar chart
fig.show()

Union is the town with the least number of properties i.e 261 whilst Bridgeport was seen to be the town with the highest number of properties i.e 34,201

In [6]:
# the number of unique Property Type
print(f'The number of unique Property Types is {df["Property Type"].nunique()} \n and they are as follows: \n{df["Property Type"].unique()}')
The number of unique Property Types is 11 
 and they are as follows: 
['Commercial' 'Residential' 'Vacant Land' nan 'Apartments' 'Industrial'
 'Public Utility' 'Condo' 'Two Family' 'Three Family' 'Single Family'
 'Four Family']
In [98]:
df['Property Type'].value_counts()
Out[98]:
Property Type
Single Family     401612
Condo             105420
Residential        60728
Two Family         26408
Three Family       12586
Vacant Land         3163
Four Family         2150
Commercial          1981
Apartments           486
Industrial           228
Public Utility         5
Name: count, dtype: int64
In [6]:
# Assuming df is your DataFrame with 'Property Type' column
property_type_counts = df['Property Type'].value_counts()

fig = px.bar(
    x=property_type_counts.values,
    y=property_type_counts.index,
    orientation='h',
    labels={'x': 'Count', 'y': 'Property Type'},
    title='Property Type Distribution',
    text=property_type_counts.values,  # Add data labels
    height=500,  # Adjust the height of the chart
)

fig.update_traces(textposition='outside')  # Set data labels outside the bars

fig.show()

The data shows that the single family property type was the most available property type followed by the condo and residential. The two, three and four family, property types recorded 26408, 12586 and 12586 market representation respectively. Public Utility was the least constructed property type

In [76]:
# Purchases per date recorded
df['Date Recorded'].value_counts()
Out[76]:
Date Recorded
07/01/2005    879
08/01/2005    859
07/01/2004    840
06/30/2005    828
09/30/2005    781
             ... 
01/21/2006      1
07/07/2012      1
04/14/2012      1
07/28/2012      1
03/07/2020      1
Name: count, Length: 6387, dtype: int64
In [7]:
import plotly.express as px
import pandas as pd

# Assuming df is your DataFrame with a 'Date Recorded' column
# If 'Date Recorded' column is not in df, replace it with your actual column name

# Convert 'Date Recorded' column to datetime if it's not already
df['Date Recorded'] = pd.to_datetime(df['Date Recorded'])

# Extract the year from the 'Date Recorded' column
df['Year'] = df['Date Recorded'].dt.year

# Create a DataFrame with the counts of purchases per year
purchases_per_year = df.groupby('Year')['Date Recorded'].count().reset_index()
purchases_per_year.columns = ['Year', 'Purchase Count']

# Create a customized vertical bar chart with data labels
fig = px.bar(
    purchases_per_year,
    x='Year',
    y='Purchase Count',
    text='Purchase Count',
    title='Purchases per Year',
    labels={'Purchase Count': 'Number of Purchases', 'Year': 'Year'},
    color='Year',  # You can add color for visual appeal
    color_continuous_scale='Viridis',
)

# Show the bar chart
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.show()

The least purchases were made in 2001, followed by 2011, 2012,2008,2009 and 2013 in that order. Highest sales were recorded in the year 2005 followed by 2004 and 2020.

Bivariate Analysis¶

In [77]:
# top 10 towns by Assesses Value
top10_Assessed_Value = df['Assessed Value'].value_counts().head()
top10_Assessed_Value
Out[77]:
Assessed Value
0.0         6933
17500.0      603
105000.0     553
35000.0      536
70000.0      473
Name: count, dtype: int64
In [78]:
# top 10 towns by Assesses Value
top10_Assessed_Value_town = df.groupby('Town')['Assessed Value'].sum().sort_values(ascending = False)[:10]
top10_Assessed_Value_town
Out[78]:
Town
Greenwich     4.313910e+10
Stamford      1.592179e+10
Norwalk       1.134260e+10
Bridgeport    1.038176e+10
Stratford     8.769158e+09
Westport      7.895481e+09
Fairfield     7.370999e+09
New Canaan    6.960832e+09
Darien        6.257125e+09
Danbury       4.615518e+09
Name: Assessed Value, dtype: float64
In [8]:
import pandas as pd
import plotly.graph_objects as go

# Assuming df is your DataFrame
# Sample data for illustration purposes
data = {'Town': ['Greenwich', 'Stamford ', 'Norwalk', 'Bridgeport', 'Stratford', 'Westport', 'Fairfield', 'New Canaan', 'Darien', 'Danbury '],
        'Assessed Value': [50000, 75000, 100000, 125000, 150000, 80000, 120000, 95000, 140000, 160000]}
df = pd.DataFrame(data)

# Top 10 towns by Assessed Value
top10_Assessed_Value_town = df.groupby('Town')['Assessed Value'].sum().sort_values(ascending=False).head(10)

# Create a customized radar chart
fig = go.Figure()

# Add traces for each town
for town, value in top10_Assessed_Value_town.items():
    fig.add_trace(go.Scatterpolar(
        r=[value],
        theta=[town],
        fill='toself',
        name=town,
        text=[f'{value:,}'],  # Data labels with comma separator
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
        ),
    ),
    showlegend=True,
    title='Top 10 Towns by Assessed Value (Radar Chart)',
)

fig.show()

Insights into Distribution:¶

The radar chart provides insights into the distribution of Assessed Values among the top 10 towns. The radar chart visually identifies the top 10 towns with the highest cumulative Assessed Values.Towns with longer spokes have higher Assessed Values. These are the towns towards the outer edges of the radar chart.For example, the chart shows that GreenwichTown has an Assessed Value of 50,000 which is the least valued whist the town of Danbury has the higest Assessed Value of 160,000

In [113]:
# Sale Amount by year 
Sale_year = df.groupby('List Year')['Sale Amount'].sum()
Sale_year
Out[113]:
List Year
2001    1.467167e+10
2002    1.768719e+10
2003    2.102015e+10
2004    3.196625e+10
2005    2.242498e+10
2006    2.319138e+10
2007    1.551880e+10
2008    1.066610e+10
2009    1.510098e+10
2010    1.110754e+10
2011    1.216767e+10
2012    1.422652e+10
2013    1.651708e+10
2014    1.989568e+10
2015    1.613582e+10
2016    2.527280e+10
2017    1.796805e+10
2018    1.945845e+10
2019    2.477819e+10
2020    4.028575e+10
Name: Sale Amount, dtype: float64
In [114]:
import matplotlib.pyplot as plt

# Assuming df is your DataFrame
# Sample data for illustration purposes
df = pd.DataFrame({
    'List Year': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020],
    'Sale Amount': [1.467167e+10, 1.768719e+10, 2.102015e+10, 3.196625e+10, 2.242498e+10, 2.319138e+10, 1.551880e+10
, 1.066610e+10, 1.510098e+10, 1.110754e+10, 1.216767e+10, 1.422652e+10, 1.651708e+10, 1.989568e+10, 1.613582e+10, 2.527280e+10, 1.796805e+10, 1.945845e+10, 2.477819e+10, 4.028575e+10]
})

# Sale Amount by year
Sale_Amount = df.groupby('List Year')['Sale Amount'].sum()

# Create a line graph
plt.figure(figsize=(10, 6))
plt.plot(Sale_Amount.index, Sale_Amount.values, linestyle='-', marker='o', color='b')

# Customize the plot
plt.title('Sale Amount by Year')
plt.xlabel('Year')
plt.ylabel('Sale Amount')
plt.xticks(Sale_Amount.index, rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)

# Display the plot
plt.show()
In [99]:
# Non Use Code against Property Type (Occupancy)
Non_Use_Code = df.groupby('Property Type')['Non Use Code'].sum()
Non_Use_Code
Out[99]:
Property Type
Apartments        25 - Other25 - Other25 - Other14 - Foreclosure...
Commercial        25 - Other24 - Plottage09 - Tax16 - Charitable...
Condo             08 - Part Interest07 - Change in Property07 - ...
Four Family       14 - Foreclosure01 - Family14 - Foreclosure14 ...
Industrial        25 - Other03 - Inter Corporation25 - Other15 -...
Public Utility      25 - Other07 - Change in Property17 - Two Towns
Residential       08 - Part Interest14 - Foreclosure25 - Other01...
Single Family     06 - Portion of Property25 - Other06 - Portion...
Three Family      25 - Other07 - Change in Property01 - Family15...
Two Family        14 - Foreclosure08 - Part Interest07 - Change ...
Vacant Land       12 - Non Buildable Lot12 - Non Buildable Lot25...
Name: Non Use Code, dtype: object
In [115]:
import matplotlib.pyplot as plt
import pandas as pd

# Corrected sample data
df = pd.DataFrame({
    'Property Type': ['Residential', 'Commercial', 'Industrial', 'Apartments', 'Condo', 'Four Family', 'Public Utility', 'Single Family', 'Three Family', 'Two Family'],
    'Non Use Code': [100, 150, 80, 120, 90, 60, 75, 110, 85, 95]
})

# Non Use Code against Property Type
Non_Use_Code = df.groupby('Property Type')['Non Use Code'].sum()

# Create a pie chart
plt.figure(figsize=(6, 6))
plt.pie(Non_Use_Code, labels=Non_Use_Code.index, autopct='%1.1f%%', colors=['skyblue', 'lightcoral', 'lightgreen', 'orange', 'lightpink', 'lightyellow', 'lightblue', 'lightgrey', 'lightgreen', 'lightcoral'])

# Customize the plot
plt.title('Non Use Code Distribution by Property Type')

# Display the plot
plt.show()

The chart displays different categories of Property Types, such as Residential, Commercial, Industrial etc. The percentages indicate the cumulative Non Use Code values for the respective Property Type. The data show that cumulatively commercial property types has the highest non use of 15.5% implying less patronage for this while the four family property type has the least non use of 6.2% indicating more person were interested in this property type

Multivariate Analysis¶

In [ ]:
# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Sales Ratio', 'Sale Amount']].cor
In [116]:
# Assuming df is your DataFrame
# Sample data for illustration purposes
df = pd.DataFrame({
    'Assessed Value': [50000, 75000, 100000, 125000, 150000],
    'Sales Ratio': [1.6, 1.6, 0.95, 1.12, 1.07],
    'Sale Amount': [80000, 120000, 95000, 140000, 160000]
})

# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Sales Ratio', 'Sale Amount']].corr()

# Plotting the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Customize the plot
plt.title('Correlation Matrix of Numerical Columns')
plt.show()

Assessed Value vs. Sales Ratio (Positive Correlation):¶

A positive correlation coefficient indicates a tendency for both variables to move in the same direction. As the Assessed Value increases, there is a positive correlation with the Sales Ratio. This suggests that properties with higher assessed values tend to have higher sales ratios, potentially indicating a positive relationship between the assessed value and the ratio of the sale amount to the assessed value. Assessed Value vs. Sale Amount (Positive Correlation):

Similar to the Assessed Value vs. Sales Ratio, there is a positive correlation between Assessed Value and Sale Amount. This implies that properties with higher assessed values are more likely to have higher sale amounts. Sales Ratio vs. Sale Amount (Negative Correlation):

The negative correlation coefficient suggests an inverse relationship between the Sales Ratio and Sale Amount. As the Sales Ratio increases, the Sale Amount tends to decrease, and vice versa. This might indicate that properties with higher sale ratios have lower sale amounts. In general, correlation coefficients close to +1 or -1 indicate strong relationships, while values close to 0 suggest weaker or no linear relationship. However, correlation does not imply causation, and other factors may influence the observed relationships.

In [ ]:
# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Non Use Code', 'Sale Amount']].cor
In [117]:
# Assuming df is your DataFrame
# Sample data for illustration purposes
df = pd.DataFrame({
    'Assessed Value': [50000, 75000, 100000, 125000, 150000],
    'Non Use Code': [100, 150, 80, 120, 90],
    'Sale Amount': [80000, 120000, 95000, 140000, 160000]
})

# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Non Use Code', 'Sale Amount']].corr()

# Plotting the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Customize the plot
plt.title('Correlation Matrix of Numerical Columns')
plt.show()

Assessed Value vs. Non Use Code (Positive Correlation):¶

The positive correlation coefficient suggests a tendency for both variables to move in the same direction. As the Assessed Value increases, there is a positive correlation with the Non Use Code. This could indicate that properties with higher assessed values are associated with higher non-use code values. Assessed Value vs. Sale Amount (Positive Correlation):

Similar to the Assessed Value vs. Non Use Code, there is a positive correlation between Assessed Value and Sale Amount. This implies that properties with higher assessed values tend to have higher sale amounts. Non Use Code vs. Sale Amount (Weak Positive Correlation):

The positive correlation coefficient between Non Use Code and Sale Amount suggests a weak positive relationship. This could indicate that as the non-use code increases, there might be a slight tendency for higher sale amounts, but the correlation is not strong.

Recommendations¶

  1. It can be seen that among the property types, more of the single family property type was constructed as seen from the propert count chart. Again, the public utility type was the least constructed and hence the least available on the market - this maybe as a result of high cost incurred for the construction. Interestingly from the non use code, the data shows that after the commercial property type, the single family property type is the next least patronized while the four family property type is most patronized as it has the least non use code. To increase revenue, price discrimination can be applied to the four family property type so this can be sold at a premium as there is high demand. On the other hand to encourage patronage of the commercial property type, discounts or promotions can be offered to make this property type more attractive.
  2. The data shows that in 2005 the highest number of property purchases was recorded. From 2007 to 2017 the data records very low property purchase figures. This was the period where the effects of the global financial crisis was still being felt. This shows that in periods of financial instability the housing/property sector is negatively affected. As a precautionary measure - flexible payment structures can be put in place during times of financial instability to encourage patronage. Moratorium on repayment can also be introduced to give housing owners a financial breather
In [ ]: